Data Analysis by R

Sample data for R

When I want to actually move and check what kind of method I use for the first time, or when I want to compare methods, I input data that I know what kind of data it is.

In the commentary articles in the world, there are cases where we use well-known data such as iris and data created with random numbers, but I also use both.

This page summarizes the data that is included in the library like iris as a csv file and then tried using it with these tools, or the sample data that I often use.

Data included in the library

In the following, the sample data included under the name "Data" is imported. This will proceed with the analysis of R under the name "Data".

By the way, in the sample code in the world, it is often named "df".

iris

ris is probably the most famous sample data.

The 1st to 4th columns are quantitative variables, and the 5th column is qualitative variables.

It can be used for methods such as making a qualitative variable an objective variable and making a quantitative variable an objective variable to handle explanatory variables that are a mixture of quantitative and qualitative variables.

Data <- iris

CO2

There are 2 columns of quantitative variables and 3 columns of qualitative variables.

Data <- CO2

warpbreaks

There are one column of quantitative variables and two columns of qualitative variables. It is experimental data of the dual arrangement. There are two types of fibers and three types of force, and it is data for the destructive test.

It can be used to analyze experimental data or to practice analyzing data with quantitative variables as count data ( such as a Generalized Linear Model).

Data <- warpbreaks

longley

There are 7 columns of quantitative variables, which are also time series data.

It includes multicollinearity and can be used to practice data analysis of real-life multivariate data. Since the 6th row is the year, it is easier to use if you remove it.

Data <- longley

airquality

Time series data. You can try Dimensionality reduction analysis of time series data. It also contains missing values (NA).

Since the 5th and 6th columns are the month and day, it is easier to handle after deleting them. If you remove these, you will have data with four quantitative variables.

Data <-airquality

mtcars

All are quantitative variables. Contains the sample name.

This is useful when you want to try Analysis of Similarity of Samples or Analysis of Many vs Many.

Data <- mtcars

LifeCycleSavings

All are quantitative variables. Contains the sample name. Usage is similar to mtcars.

Data <- LifeCycleSavings

state.x77

All are quantitative variables. Contains the sample name. Usage is similar to mtcars.

Since the data is from each state in the United States, it is easy to imagine the contents of the data.

Data <- state.x77

UScitiesD

It is a distance matrix. The distance between cities in the United States.

It can be used when you want to try the multidimensional scaling method when starting the distance matrix, such as the Multidimensional scaling.

Data <- UScitiesD

eurodist

It is a distance matrix. This is the distance between European cities.

Data <- eurodist

Self-made sample data

I often use EXCEL to create my own data for experiments. EXCEL is easy to create data while entering a little outlier or checking the state of the data on a scatter plot.

The following is an example using R. If you make it in R, you only have to work in the editor of R.

Create a sequence of random numbers that follow a normal distribution

This is a method to create 100 random numbers with an average of 10 and a standard deviation of 2. In the data frame named Data, there will be a variable named X1.

If you calculate the average or standard deviation with the random numbers you created, it will not be 10.00000000 or 2.00000000, but it will be about that number. Moreover, it changes every time. If you don't like this, you need to make about 100,000 pieces. It's not a complete solution, but at least it doesn't have enough error to change the conclusions of the analysis.

X1 <- 10 + 2 * rnorm(100)
Data <- as.data.frame(X1)

Create two uncorrelated variables

Create a random number that has no correlation with X1 and has a mean of 20 and a standard deviation of 3.

I wrote that there is no correlation, but the correlation coefficient does not become 0. The larger the number of samples, the closer to 0.

n <- 100
X1 <- 10 + 2 * rnorm(n)
Data <- as.data.frame(X1)
Data$X2 <- 20 + 3 * rnorm(n)

Create two correlated variables

A simple regression analysis on X1 and X2 creates two variables with a slope of 3, a Y-intercept of 40, and a high correlation between X1 and X2.

n <- 100
X1 <- 10 + 2 * rnorm(n)
Data <- as.data.frame(X1)
Data$X2 <- 3 * Data$X1 +40 + 0.01 * rnorm(n)

The strength of the correlation changes where it is set to "0.01". The higher the value, the lower the correlation. In this example, if you set it to "100", the correlation coefficient will be about 0.1. Even with the same "100", the degree of influence on the correlation changes depending on the size and slope of the distribution of X1.

Create sample data for multiple regression analysis

There are variables Y, X1, X2, and X3, and when performing multiple regression analysis, the model formula "Y = X1 + X2" is the best, and it is how to create data when X3 does not need to be included.

n <- 100
X1 <- rnorm(n)
Data <- as.data.frame(X1)
Data$X2 <- rnorm(n)
Data$X3 <- rnorm(n)
Data$Y <- Data$X1 + Data$X2 + 0.01 * rnorm(n)

Make a sequence of random numbers that follow a uniform distribution

Create a random number with a uniform distribution that falls in the range 0 to 1.

Multiply this by 100 to get a range of 0 to 100.

X1 <- runif(100)
Data <- as.data.frame(X1)

Create a column of variables with evenly spaced numbers

Random numbers with a uniform distribution are not evenly spaced.

For example, this is the method when you want to make the variables on the horizontal axis of the scatter plot evenly spaced.

When dividing 0 to 1 into 10 equal parts to make 11 samples. You can make numbers in 0.1 increments.

n <- 11
Xmin <- 0
Xmax <- 1
X1 <- 0
Data <- as.data.frame(X1)
for (i in 1:n-1) {
Data[1+i,1]<- (Xmax - Xmin)/(n-1)*i+Xmin
}

Save the sample data as a csv file

In the sample code of Data Analysis by R ,R-EDA1 and R-QCA1 , the csv file can be used as input data in consideration of versatility.

In the following, it is assumed that the sample data put in the PC with the name "Data" will be saved as "Data.csv" in the folder "Rtest" directly under the C drive.

If there is no sample name

write.csv(Data, row.names = FALSE, "C:/Rtest/Data.csv")

If the data frame does not have a sample name, save it as "row.names = FALSE".

If you set it to TRUE, it will be a csv file that contains the line number in the first column. If you also want to use line numbers in your analysis, set it to TRUE.

If there is a sample name

write.csv(Data, row.names = TRUE, "C:/Rtest/Data.csv")

For example, in the case of mtcars, the car name is a sample name, not a variable.

If you set "row.names = FALSE", a file without sample names will be created.